Fully Non-Parametric Feature Selection Models for Text Categorization
نویسنده
چکیده
As an important preprocessing technology in text categorization (TC), feature selection can improve the scalability, efficiency and accuracy of a text classifier [1]. In this paper, we propose a two-stage TC model—a fully non-parametric feature selection model. This classification model first selects features using non-parametric statistical methods, and then applies non-parametric classifiers on the selected features for categorization. The improvements are based on two ideas: non-parametric models require no prior knowledge of the data probability distribution, and feature selection could improve classification efficiency and accuracy. We find a competitive fully non-parametric feature selection model: using artificial neural network (ANN) on the features selected by X" statistic (CHI). This model obtains better classification accuracy and requires fewer features than that of the state-of-the-art classifier (support vector machine) on real-world datasets. Therefore, ANN with CHI classification model is good to use on the data with unknown underlying probability distribution. We also discover that two statistical feature selection methods, Kendall rank correlation coefficient and Spearman rank correlation, are strong correlated in TC feature selection.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملOptimally Combining Positive and Negative Features for Text Categorization
This paper presents a novel local feature selection approach for text categorization. It constructs a feature set for each category by first selecting a set of terms highly indicative of membership as well as another set of terms highly indicative of non-membership, then unifying the two sets. The size ratio of the two sets was empirically chosen to obtain optimal performance. This is in contra...
متن کاملA New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کامل